Implement a single-layer fully connected neural network model to classify MNIST images. It inputs a raw image as a 1D vector of length 784=28x28 and outputs a vector of length 10 (each dimension corresponds to a class). You may want to add a bias term to the input, but that is optional for this assignment. The output is connected to the first layer simply by a set of linear weights. The second layer uses Softmax function as the non-linearity function. Softmax is a simple function that converts a n-dimensional input (z) to a n-dimensional output (o) where the output sums to one and each element is a value between 0 and 1. It is defined as
$$y_i = \frac{exp(x_i)}{\sum_{j=1}^n exp(x_j)}$$When we apply this function to the output of the network, $o$, it predicts a vector which can be seen as the probability of each category given the input $x$:
$$P(c_i|x) = \frac{exp(o_i)}{\sum^n_{j=1} exp(o_j )}$$where $n$ is the number of categories, 10, in our case. We want the $i$’th output to mimic $P(c_i|x)$, the probability of the input $x$ belonging to the category $i$. We can represent the desired probability distribution as the vector $gt$ where $gt(i)$ is one only if the input is from the $i$’th category and zero otherwise. This is called one-hot encoding. Assuming $x$ is from $y$’th category, $gt(y)$ is the only element in $gt$ that is equal to one. Then, we want the output probability distribution to be similar to the desired one (ground-truth). Hence, we use cross-entropy loss to compare these two probability distributions, $P$ and $gt$:
$$L(x,y,w) = \sum_{i=1}^n - gt(i) log(P(c_i|x))$$where $n$ is the number of categories. Since $gt$ is one hot encoding, we can remove the terms of $gt$ that are zero, keeping only the $y$’th term. Since $gt(y) = 1$, we can remove it in the multiplication to come up with the following loss which is identical to the above one:
$$L(x,y,w) = - log(P(c_y|x))$$This is the loss for one input only, so the total loss on a mini-batch is:
$$L = \sum_{k=1}^N -log(P(c_{yk}|x_k)) $$where $N$ is the size of mini-batch, number of training data being processed at this iteration.
Please implement stochastic gradient descend (SGD) algorithm from scratch to train the model. You may use NumPy, but should not use PyTorch, TensorFlow, or any similar deep learning framework. Use mini-batch of 10 images per iteration. Then, train it on all MNIST train dataset and plot the accuracy on all test data for every n iteration. Please choose n small enough so that the graph shows the progress of learning and large enough so that testing does not take a lot of time. You may use smaller n initially and then increase it gradually as the learning progresses. Choose a learning rate so that the loss goes down.
Answer:
Using a training data with $60,000$ instances and a mini-batch of size $10$ for every iteration mean that we have a total of $\frac{instances}{batch size} = \frac{60,000}{10} = 6,000$ batches or iterations.
Since this is a multi-class classification problem, we want to minimize the cross-entropy loss function: $$L(Y, \hat{Y})= - \sum_{i=1}^n y_i \cdot log(\hat{y_i}) $$
where $n= 10$ is the number of categories, $y_i =$entries of the ground truth label $Y$ (one-hot encoded), and $\hat{y_i} =$ entries of the prediction vector $\hat{Y}$
If we take the average of the loss function over training samples of batch size $m$, we have
$$L(Y, \hat{Y})= - \frac{1}{m} \sum_{j=1}^m \sum_{i=1}^n y_i \cdot log(\hat{y_i})$$The prediction vector $\hat{Y}$ can be expressed as a vector of probabilites that sums up to 1. These probabilities can be obtained using the softmax function:
$$s_i = \frac{e^{z_i}}{\sum_{j=0}^9 e^{z_j}}$$where vector $z$ is the input.
Forward propagation:
$z = w^TX +b $
where $\sigma$ is a Softmax function
Hence,
$s = \sigma (z) = \frac{e^{(w^TX +b)}}{\sum_{j=0}^9 e^{(w^TX +b)}} \ \ \ \ (1) $
We, then use backpropagation to adjust the weights and biases throughout the network based on the calculated cost so that the cost will be lower in the next iteration.
Before implementing the back propation step, we need to obtain the gradient or derivative of the loss with respect to weighted input $z$ of the output layer as follows. We let $\hat{y_i} = s_i$, which is the softmax output
$\frac{\partial L}{\partial z_j} = - \frac{\partial }{\partial z_j} \sum_{i=1}^n y_i \cdot log(s_i) $
$ \ \ \ \ \ \ = - \sum_{i=1}^n y_i \cdot \frac{\partial }{\partial z_j} log(s_i) \ , \ $ derive $\frac{\partial}{\partial z_j} log(s_i)$ below
$ \ \ \ \ \ \ = - \sum_{i=1}^n y_i \cdot \frac{\partial }{\partial z_j} log\Big(\frac{e^{z_i}}{\sum_{j=0}^9 e^{z_j}} \Big) $
$ \ \ \ \ \ \ = - \sum_{i=1}^n \frac{y_i}{s_i} \cdot \frac{\partial s_i}{\partial z_j} $
$ \ \ \ \ \ \ = - \sum_{i=1}^n \frac{y_i}{s_i} \cdot s_i \Big[ \mathbb{1} \{i=j\} - s_j \Big] $
$ \ \ \ \ \ \ = - \sum_{i=1}^n {y_i} \Big[ \mathbb{1} \{i=j\} - s_j \Big] $
$ \ \ \ \ \ \ = - \sum_{i=1}^n {y_i} \cdot \mathbb{1} \{i=j\} + \sum_{i=1}^n {y_i} \cdot s_j $
$ \ \ \ \ \ \ = - y_j + \sum_{i=1}^n {y_i} \cdot s_j $
$ \ \ \ \ \ \ = s_j \sum_{i=1}^n {y_i} - y_j \ \ , \ \ $ $\sum_{i=1}^n {y_i} = 1$ since one-hot encoded vector $Y$ sums to $1$
$ \ \ \ \ \ \ = s_j - y_j $
$ \frac{\partial L}{\partial z} = s - y $
$ \ \ \ \ \ \ $
$\frac{\partial z}{\partial w_j} = \frac{\partial }{\partial w_j}w^TX +b$
$ \ \ \ \ \ \ \ = \frac{\partial }{\partial w_j}(w_0x_0 + \cdots + w_nx_n +b)$
$ \ \ \ \ \ \ \ = w_j$
$ \ \ \ \ \ \ $
$\frac{\partial s}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}}{\partial z_j}$
$ \ \ \ \ \ = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}}{\partial z_j}$
$ \ \ \ \ \ = \frac{e^{z_i} \sum_{k=1}^N e^{z_k} - e^{z_j}e^{a_i}}{\left( \sum_{k=1}^N e^{z_k}\right)^2} $
$ \ \ \ \ \ = \frac{e^{z_i} \left( \sum_{k=1}^N e^{z_k} - e^{z_j}\right )}{\left( \sum_{k=1}^N e^{z_k}\right)^2} $
$ \ \ \ \ \ = \frac{ e^{z_j} }{\sum_{k=1}^N e^{z_k} } \times \frac{\left( \sum_{k=1}^N e^{z_k} - e^{z_j}\right ) }{\sum_{k=1}^N e^{z_k} }$
$ \ \ \ \ \ = s_i(1-s_j) \ \ , \ $ for $i = j$
$ \ \ \ \ \ \ $
$\frac{\partial s}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}}{\partial z_j}$
$ \ \ \ \ \ = \frac{0 - e^{z_j}e^{z_i}}{\left( \sum_{k=1}^N e^{z_k}\right)^2}$
$ \ \ \ \ \ = \frac{0 - e^{z_j}e^{z_i}}{\left( \sum_{k=1}^N e^{z_k}\right)^2} $
$ \ \ \ \ \ = \frac{- e^{z_j} }{\sum_{k=1}^N e^{z_k} } \times \frac{e^{z_i} }{\sum_{k=1}^N e^{z_k} }$
$ \ \ \ \ \ = - s_j.s_i \ \ , \ $ for $i \ne j$
Hence,
$\frac{\partial s_i}{\partial z_j} = s_i(1-s_j)$
$ \ \ \ \ \ \ $
To see how $L$ changes w.r.t. each component $w_j$ of $w$, we can compute $\frac{\partial L}{\partial w_j}$ using chain rule as follows
$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial s} \frac{\partial s}{\partial z} \frac{\partial z}{\partial w_j}$
$ \ \ \ \ \ \ = \frac{\partial L}{\partial s} \frac{\partial s}{\partial z} \frac{\partial z}{\partial w_j}$
$ \ \ \ \ \ \ = (s-y)(s(1-s))w$
$ \ \ \ \ \ \ = (s-y)(s(1-s))w$
In vectorized form this is
$\frac{\partial L}{\partial w} = \frac{1 }{m} (s-y) X^T \ \ , \ $ where $s =$ predicted labels
$ \ \ \ \ \ \ $
Similarly,
$\frac{\partial z}{\partial b_j} = \frac{\partial }{\partial w_j}w^TX +b$
$ \ \ \ \ \ \ \ = \frac{\partial }{\partial b_j}(w_0x_0 + \cdots + w_nx_n +b)$
$ \ \ \ \ \ \ \ = 1$
Thus,
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial s} \frac{\partial s}{\partial z} \frac{\partial z}{\partial b_j}$
$ \ \ \ \ \ \ = (s-y)(s(1-s))$
In vectorized form this is
$\frac{\partial L}{\partial b} = \frac{1 }{m} \sum_{i=1}^m(s-y)$
$ \ \ \ \ \ \ $
To adjust the weights in each iteration we use the gradient descent update rule with fixed stepsize $\alpha$: $w^{(k+1)} = w^k - \alpha \nabla f(x^k)$
so the new updated parameters for a single layer is
$w_1^{new} = w_1 - \alpha \frac{\partial L}{\partial w_1} $
$b_1^{new} = b_1 - \alpha \frac{\partial L}{\partial b_1} $
where $L$= average loss
Reference(s): https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
# Import dependencies
import pandas as pd
import nbconvert
import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import operator
from operator import itemgetter
import plotly.express as px
import timeit
import plotly.io as pio
pio.renderers.default='notebook'
#one hot encoding function
def one_hot_enc(Y):
t = np.zeros((Y.shape[0], 10))
for i in range(Y.shape[0]):
t[i][int(Y[i][0])] = 1
return t
#normalization function
def normalize(X):
X = X / 255 #divide by 255 since each pixel-value is a grayscale integer between 0 and 255
return X
#Load dataset
M = loadmat('..\HW1\data\MNIST_digit_data.mat')
#assign labels and pre-normalized images from dictionary into individual arrays
images_train,images_test,labels_train,labels_test= M['images_train'],M['images_test'],M['labels_train'],M['labels_test']
#set random seed
np.random.seed(1)
#randomly permute data points
inds = np.random.permutation(images_train.shape[0])
images_train = images_train[inds]
labels_train = labels_train[inds]
inds = np.random.permutation(images_test.shape[0])
images_test = images_test[inds]
labels_test = labels_test[inds]
#one hot encode labels
labels_train_enc = one_hot_enc(labels_train).astype(float)
labels_test_enc = one_hot_enc(labels_test).astype(float)
#transpose training and test arrays to avoid transposing
#them in the following functions
X_train = images_train.T
Y_train = labels_train_enc.T
X_test = images_test.T
Y_test = labels_test_enc.T
#display final test and train data sizes
print(X_train.shape, Y_train.shape)
print(X_test.shape,Y_test.shape)
(784, 60000) (10, 60000) (784, 10000) (10, 10000)
#softmax function
def softmax(z):
"""
Computes softmax
Inputs:
- z: result of W^T.X+b
"""
return np.exp(z) / np.sum(np.exp(z), axis=0)
#loss function (cross entropy loss)
def loss(Y, Y_pred):
"""
Computes average loss
Inputs:
- z: result of W^T.X+b
"""
log_sum = -np.sum(np.multiply(Y, np.log(Y_pred))) #log sum
m = Y.shape[1]
L = (1./m) * log_sum #average over m=10 training samples
return L
#accuracy function
def accuracy(labels, predictions):
"""
Computes test accuracy for every iteration = number of batches x epochs
Inputs:
- labels: actual labels (one-hot encoded)
- predictions: predicted labels
"""
total_correct = 0
m = labels.shape[1] #number of instances contained in array of actual labels
predictions = np.argmax(predictions, axis=0) #extract index of largest probability which represents predicted label
labels = np.argmax(labels, axis=0) #extract index of largest value (equals to one), which represents actual label
#compare indices and count how many of them are the same
for i in range(len(labels)):
if (predictions[i] == labels[i]):
total_correct += 1
return total_correct/m*100
#forward propagation function
def forward_fc(X, params):
"""
Computes the forward pass for an affine fully-connected layer
Inputs:
- X: Input training images batch (dimension: 784 x batch size)
- params: Weights (dimension: 10x784) and Bias (dimension: 10x1)
"""
dict_f={} #initialize empty dictionary for forward results
# input layer to l1: Z1 = w1^T*x + b1
dict_f['z1'] = np.matmul(params['w1'],X) + params['b1'] #output
dict_f['l1'] = softmax(dict_f['z1']) #get probabilities of the output using softmax
return dict_f
#backward propagation function
def backward_fc(X,Y,params,dict_f,batch_size):
"""
Computes the backward pass for an affine fully-connected layer
Inputs:
- X: Input training images batch (dimension: 784 x batch size)
- Y: Input taining labels batch (dimension: 10 x batch size)
- params: Weights (dimension: 10x784) and Bias (dimension: 10x1)
- dict_f: output of forward propagation process
- batch_size: size of every batch
"""
#intiliaze empty dictionary for results
dict_b = {}
#compute derivatives of layer 2 wrt z, w, and b
dz1 = dict_f['l1'] - Y
dict_b['dw1'] = (1./batch_size) * np.matmul(dz1, X.T)
dict_b['db1'] = (1./batch_size) * np.sum(dz1, axis=1, keepdims=True) #bias update
return dict_b
#single-layer fully connected network with mini batch SGD
def mini_batch_fc(X_train,Y_train, X_test, Y_test, input_size = 28*28, output_size = 10, batch_size = 10, epoch = 3, rand_sample = True):
np.random.seed(11)
#initialize
m = X_train.shape[1] #number of training samples
beta = 0.9 #weight parameter for moving average between 0 and 1 (higher to smooth out update)
learning_rate = 0.1
batches = int(X_train.shape[1] / batch_size) #number of batches
#initilialize empty accuracy array
results_per_iter = []
results_per_epoch = []
#initialize parameters
#divide by sqrt(n) to adjust the variance to 1/n
params = { "w1": np.random.randn(output_size, input_size) * np.sqrt(1. / input_size),
"b1": np.zeros((output_size, 1)) * np.sqrt(1. / input_size) }
#initialize exponential moving average of gradients as 0
#needed to add momentum
dw1_v = np.zeros(params["w1"].shape)
db1_v = np.zeros(params["b1"].shape)
#loop through each batch of samples
for n in range(epoch):
print("*** Epoch {} ***".format(n))
if (rand_sample==True):
#randomly permute column indices
indices = np.random.permutation(m)
X = X_train[:, indices]
Y = Y_train[:, indices]
else:
X = X_train
Y = Y_train
#counter
count = 0
#iterate through each batch of 10 samples in m total training samples
for i in range(0, m, batch_size):
#assign i-th batch to variables
X_i = X[:, i:i+batch_size]
Y_i = Y[:, i:i+batch_size]
#perform forward and backward pass
dict_f = forward_fc(X_i, params)
dict_b = backward_fc(X_i, Y_i, params, dict_f, batch_size)
#update parameters using GD update rule with fixed learning rate
params["w1"] = params["w1"] - learning_rate * dict_b["dw1"]
params["b1"] = params["b1"] - learning_rate * dict_b["db1"]
#train
dict_f =forward_fc(X_train, params)
train_loss = loss(Y_train, dict_f["l1"])
#test
dict_f = forward_fc(X_test, params)
test_loss = loss(Y_test, dict_f["l1"])
acc = accuracy(Y_test, dict_f["l1"])
#update counter
count += 1
#save results
iter_results = {'iteration': count , 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
results_per_iter.append(iter_results)
#display results
c = batches/6
if (count%c == 0) :
print("Training {}: training loss = {}, test loss = {}, test accuracy = {} ".format(count,train_loss, test_loss, acc))
#train
dict_f =forward_fc(X_train, params)
train_loss = loss(Y_train, dict_f["l1"])
#test
dict_f = forward_fc(X_test, params)
test_loss = loss(Y_test, dict_f["l1"])
acc = accuracy(Y_test, dict_f["l1"])
#save results
e = n+1
results = {'epoch': e, 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
results_per_epoch.append(results)
print("Training done!")
return results_per_iter, results_per_epoch, dict_f, dict_b, params
%%time
results_per_iter, results_per_epoch, dict_f, dict_b, params = mini_batch_fc(X_train,Y_train, X_test, Y_test, input_size = 28*28, output_size = 10, batch_size = 10, epoch = 3, rand_sample = True)
*** Epoch 0 *** Training 1000: training loss = 0.37615839867988443, test loss = 0.3599463301678172, test accuracy = 89.96 Training 2000: training loss = 0.3269326732542967, test loss = 0.3144352901435516, test accuracy = 91.10000000000001 Training 3000: training loss = 0.3287269554577137, test loss = 0.3255100510059334, test accuracy = 90.67 Training 4000: training loss = 0.3204995438744067, test loss = 0.31679168268565683, test accuracy = 90.85 Training 5000: training loss = 0.306562028891086, test loss = 0.30231847903932185, test accuracy = 91.06 Training 6000: training loss = 0.2983037603144213, test loss = 0.2956442065832815, test accuracy = 91.57 Training done! *** Epoch 1 *** Training 1000: training loss = 0.29257699723112746, test loss = 0.2930958036275988, test accuracy = 91.8 Training 2000: training loss = 0.30194958275611844, test loss = 0.3059924591959284, test accuracy = 91.34 Training 3000: training loss = 0.2825303936437372, test loss = 0.28377832939147624, test accuracy = 91.74 Training 4000: training loss = 0.28440182640761935, test loss = 0.28700242833715667, test accuracy = 91.96 Training 5000: training loss = 0.2814790892559006, test loss = 0.2857230830952564, test accuracy = 92.05 Training 6000: training loss = 0.2789279010135515, test loss = 0.28799861825356926, test accuracy = 92.0 Training done! *** Epoch 2 *** Training 1000: training loss = 0.2805534258884522, test loss = 0.28771444650791445, test accuracy = 92.0 Training 2000: training loss = 0.2915605419481089, test loss = 0.29457788364437537, test accuracy = 91.73 Training 3000: training loss = 0.2760029088285485, test loss = 0.2846294907466463, test accuracy = 92.12 Training 4000: training loss = 0.2942687443367019, test loss = 0.30076067154865577, test accuracy = 91.28 Training 5000: training loss = 0.2756507231819271, test loss = 0.2865742773133869, test accuracy = 91.86 Training 6000: training loss = 0.27494525432394273, test loss = 0.2845414392539408, test accuracy = 92.21000000000001 Training done! CPU times: total: 1h 25min 38s Wall time: 16min 31s
df_iter = pd.DataFrame.from_dict(results_per_iter)
df_iter
| iteration | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 2.317132 | 2.323152 | 15.12 |
| 1 | 2 | 2.248185 | 2.254529 | 21.19 |
| 2 | 3 | 2.133389 | 2.133171 | 24.72 |
| 3 | 4 | 2.050330 | 2.056600 | 19.90 |
| 4 | 5 | 1.897787 | 1.895859 | 38.00 |
| ... | ... | ... | ... | ... |
| 17995 | 5996 | 0.275367 | 0.283664 | 92.21 |
| 17996 | 5997 | 0.274989 | 0.283700 | 92.22 |
| 17997 | 5998 | 0.275292 | 0.284007 | 92.24 |
| 17998 | 5999 | 0.273485 | 0.282479 | 92.30 |
| 17999 | 6000 | 0.274945 | 0.284541 | 92.21 |
18000 rows × 4 columns
#Plot the accuracy on all test data for every n iteration
import plotly
import plotly.express as px
import plotly.graph_objects as go
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:',
yanchor="top",
y=0.25,
xanchor="left",
x=0.85), title = 'Test accuracy for every iteration')
fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')
subop = {'Epoch 1': df_iter[ 'test_accuracy'][0:6000,],
'Epoch 2': df_iter[ 'test_accuracy'][6000:12000,],
'Epoch 3': df_iter[ 'test_accuracy'][12000:18000,] }
for k, v in subop.items():
fig.add_scatter(x=v.index, y = v, name = k )
fig.show()
df_epoch = pd.DataFrame.from_dict(results_per_epoch)
df_epoch
| epoch | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 0.298304 | 0.295644 | 91.57 |
| 1 | 2 | 0.278928 | 0.287999 | 92.00 |
| 2 | 3 | 0.274945 | 0.284541 | 92.21 |
For each class, visualize the 10 images that are misclassified with the highest score along with their predicted label and score. These are very confident wrong predictions.
Answer:
#extract misclassified labels
misc = []
a = np.argmax(Y_test, axis=0)
b = np.argmax(dict_f["l1"], axis=0)
for i in range(len(a)):
if(a[i] !=b[i] ):
a_prob = dict_f["l1"][:, i][a[i]] #probability for actual
p_prob = np.amax(dict_f["l1"][:, i]) #probability for predicted
misc_i ={'i_a':i , 'a_label': a[i], 'a_prob':a_prob , 'p_label': b[i], 'p_prob': p_prob}
misc.append(misc_i)
misc_df = pd.DataFrame.from_dict(misc)
misc_df
| i | a_label | a_prob | p_label | p_prob | |
|---|---|---|---|---|---|
| 0 | 2 | 5 | 0.180325 | 3 | 0.685709 |
| 1 | 15 | 8 | 0.117822 | 6 | 0.731380 |
| 2 | 26 | 5 | 0.211179 | 3 | 0.786947 |
| 3 | 28 | 4 | 0.014444 | 1 | 0.891716 |
| 4 | 31 | 5 | 0.084143 | 8 | 0.842130 |
| ... | ... | ... | ... | ... | ... |
| 774 | 9973 | 8 | 0.320126 | 9 | 0.555686 |
| 775 | 9977 | 2 | 0.106816 | 8 | 0.766568 |
| 776 | 9979 | 3 | 0.447184 | 0 | 0.487849 |
| 777 | 9986 | 8 | 0.046612 | 0 | 0.947382 |
| 778 | 9990 | 8 | 0.169839 | 4 | 0.398347 |
779 rows × 5 columns
#Top 1 misclassified per category
max_misc = misc_df.groupby('a_label')['p_prob'].max()
#print(max_misc)
max_misc_df = misc_df[misc_df['p_prob'].isin(max_misc)].sort_values('a_label')
print(max_misc_df)
i a_label a_prob p_label p_prob 203 2716 0 0.006809 6 0.955316 700 9046 1 0.006577 6 0.976817 678 8729 2 0.000492 7 0.999123 106 1294 3 0.002487 2 0.997257 144 1948 4 0.009762 6 0.987989 254 3385 5 0.000138 6 0.999172 549 7204 6 0.004092 0 0.995046 671 8647 7 0.000739 2 0.998289 202 2712 8 0.000069 4 0.998524 529 6976 9 0.004846 4 0.994071
#Top 10 misclassified images per predicted category
##group by predicted label and sort predicted label scores
max_misc = misc_df.groupby(['p_label']).apply(lambda x: x.sort_values(['p_prob'], ascending = False))
max_misc.reset_index(drop = True, inplace = True)
##keep only top 10 scores
max_misc = max_misc.groupby('p_label').head(10)
max_misc['p_prob']=max_misc['p_prob'].round(4)
max_misc
| i | a_label | a_prob | p_label | p_prob | |
|---|---|---|---|---|---|
| 0 | 4729 | 5 | 0.000057 | 0 | 0.9982 |
| 1 | 7204 | 6 | 0.004092 | 0 | 0.9950 |
| 2 | 9592 | 6 | 0.012235 | 0 | 0.9877 |
| 3 | 1208 | 9 | 0.000834 | 0 | 0.9876 |
| 4 | 3357 | 4 | 0.000004 | 0 | 0.9863 |
| ... | ... | ... | ... | ... | ... |
| 708 | 594 | 4 | 0.040770 | 9 | 0.9021 |
| 709 | 7264 | 4 | 0.055380 | 9 | 0.8927 |
| 710 | 2325 | 4 | 0.085306 | 9 | 0.8855 |
| 711 | 4310 | 2 | 0.003603 | 9 | 0.8806 |
| 712 | 4367 | 4 | 0.123009 | 9 | 0.8526 |
100 rows × 5 columns
The following shows the images that are misclassified as class $i$ with corresponding scores on the left side of each image
#visualize number that are highly misclassified
import matplotlib.pyplot as plt
# create figure
fig = plt.figure(figsize=(20, 20))
rows = 10
columns = 10
for n in range(100):
# Adds a subplot at the 1st position
fig.add_subplot(rows, columns, n+1)
ind = int(max_misc.iloc[n]['i'])
im = X_test[:, ind].reshape((28,28),order='F')
plt.imshow(im)
#plt.suptitle('Actual Label:'+str(max_misc_df.loc[max_misc_df['i'] == ind, 'a_label'].iloc[0]))
plt.title('Predicted Label: ' +str(max_misc.loc[max_misc['i'] == ind, 'p_label'].iloc[0]))
plt.ylabel('Score: '+ str(max_misc.loc[max_misc['i'] == ind, 'p_prob'].iloc[0]))
plt.tick_params(left = False, right = False , labelleft = False ,
labelbottom = False, bottom = False)
#plt.show()
Please reduce the number of training data to 1 example per class (chosen randomly from training data) and plot the curve (accuracy vs, iterations). The whole training data will be 10 images only.
#convert to dataframe
df = pd.DataFrame(labels_train, columns =['train_label'])
df['index'] = df.index
#random sampling
size = 1 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df_rand = df.groupby('train_label', as_index=False).apply(fn)
df_rand
| train_label | index | ||
|---|---|---|---|
| 0 | 43577 | 0 | 43577 |
| 1 | 36932 | 1 | 36932 |
| 2 | 9841 | 2 | 9841 |
| 3 | 53134 | 3 | 53134 |
| 4 | 11712 | 4 | 11712 |
| 5 | 1615 | 5 | 1615 |
| 6 | 6165 | 6 | 6165 |
| 7 | 59774 | 7 | 59774 |
| 8 | 27124 | 8 | 27124 |
| 9 | 48915 | 9 | 48915 |
#use random indices to filter original training data
X_train_rs = np.array([X_train[:, index] for index in df_rand['index']])
X_train_rs = X_train_rs.T
X_train_rs.shape
(784, 10)
Y_train_rs = np.array([Y_train[:, index] for index in df_rand['index']])
Y_train_rs = Y_train_rs.T
Y_train_rs.shape
(10, 10)
#check filter result
labels_train[59255]
array([0], dtype=uint8)
Y_train_rs
array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
%%time
#Train
results_per_iter3, results_per_epoch3, dict_f3, dict_b3, params3 = mini_batch_fc(X_train_rs, Y_train_rs, X_test, Y_test, input_size = 28*28, output_size = 10, batch_size = 10, epoch = 3, rand_sample = True)
*** Epoch 0 *** Training done! *** Epoch 1 *** Training done! *** Epoch 2 *** Training done! CPU times: total: 250 ms Wall time: 64.8 ms
df_iter1 = pd.DataFrame.from_dict(results_per_iter3)
df_iter1
| iteration | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 1.791778 | 2.216344 | 17.20 |
| 1 | 1 | 1.379278 | 2.114741 | 25.36 |
| 2 | 1 | 1.075647 | 2.036218 | 31.57 |
#Plot results
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:',
yanchor="top",
y=0.25,
xanchor="left",
x=0.85), title = 'Test accuracy for every iteration')
fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')
subop = {'Epoch 1': df_iter1[ 'test_accuracy'],
'Epoch 2': df_iter1[ 'test_accuracy'],
'Epoch 3': df_iter1[ 'test_accuracy']}
for k, v in subop.items():
fig.add_scatter(x=v.index, y = v, name = k )
fig.show()
Note: Since the whole training data has only 10 instances, then the number of iterations for batch size of 10 is only 1.
Try different mini-batch sizes (1, 10, 100) for the original case and plot the results. Which one is better and why?
Answer:
Since I already obtained results for batch size = 10 in part (1), I am using batch sizes 1 and 100 below.
Batch size = 1
To save time, I'm only running this in 2 epochs
%%time
#Train batch size=1
results_per_ite4a, results_per_epoch4a, dict_f4a, dict_b4a, params4a = mini_batch_fc(X_train, Y_train, X_test, Y_test, input_size = 28*28, output_size = 10, batch_size = 1, epoch = 2, rand_sample = True)
*** Epoch 0 *** Training 10000: training loss = 0.9546637752152007, test loss = 0.9358143878755834, test accuracy = 84.76 Training 20000: training loss = 0.8806656767540756, test loss = 0.8741450320052128, test accuracy = 86.17 Training 30000: training loss = 0.7634004504789411, test loss = 0.8054683434113198, test accuracy = 88.58 Training 40000: training loss = 0.8282999818547326, test loss = 0.8749206799428325, test accuracy = 87.66000000000001 Training 50000: training loss = 0.762895691864133, test loss = 0.8022676882358395, test accuracy = 88.53999999999999 Training 60000: training loss = 0.8669361741589076, test loss = 0.8844366358183816, test accuracy = 87.94999999999999 Training done! *** Epoch 1 *** Training 10000: training loss = 0.8185169450418242, test loss = 0.8519542665320383, test accuracy = 88.58 Training 20000: training loss = 0.9537208859597216, test loss = 1.0576902741717993, test accuracy = 85.28 Training 30000: training loss = 0.7452535705334342, test loss = 0.7561818257201999, test accuracy = 89.12 Training 40000: training loss = 0.6526562792931612, test loss = 0.7058862962016057, test accuracy = 90.79 Training 50000: training loss = 0.8486276846957119, test loss = 0.9006482851497486, test accuracy = 88.17 Training 60000: training loss = 0.801947417161944, test loss = 0.8802016562949941, test accuracy = 89.21 Training done! CPU times: total: 9h 18min 9s Wall time: 1h 48min 12s
df_iter2 = pd.DataFrame.from_dict(results_per_ite4a)
df_iter2
| iteration | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 3.121829 | 3.155109 | 10.32 |
| 1 | 2 | 2.708463 | 2.726815 | 10.24 |
| 2 | 3 | 3.816171 | 3.855504 | 11.35 |
| 3 | 4 | 4.352786 | 4.410227 | 10.45 |
| 4 | 5 | 3.177153 | 3.201579 | 22.85 |
| ... | ... | ... | ... | ... |
| 119995 | 59996 | 0.801954 | 0.880207 | 89.21 |
| 119996 | 59997 | 0.801947 | 0.880202 | 89.21 |
| 119997 | 59998 | 0.801947 | 0.880202 | 89.21 |
| 119998 | 59999 | 0.801947 | 0.880202 | 89.21 |
| 119999 | 60000 | 0.801947 | 0.880202 | 89.21 |
120000 rows × 4 columns
#Plot results
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:',
yanchor="top",
y=0.25,
xanchor="left",
x=0.85), title = 'Test accuracy for every iteration')
fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')
subop = {'Epoch 1': df_iter2[ 'test_accuracy'][0:60000,],
'Epoch 2': df_iter2[ 'test_accuracy'][60000:120000,]}
for k, v in subop.items():
fig.add_scatter(x=v.index, y = v, name = k )
fig.show()
Batch size = 100
%%time
#Train batch size=100
results_per_iter4b, results_per_epoch4b, dict_f4b, dict_b4b, params4b = mini_batch_fc(X_train, Y_train, X_test, Y_test, input_size = 28*28, output_size = 10, batch_size = 100, epoch = 3, rand_sample = True)
*** Epoch 0 *** Training 100: training loss = 0.6148466417513304, test loss = 0.5910929348083898, test accuracy = 86.67 Training 200: training loss = 0.4925443093858635, test loss = 0.4700176354280208, test accuracy = 88.5 Training 300: training loss = 0.44590489595391625, test loss = 0.42565624646282363, test accuracy = 88.88000000000001 Training 400: training loss = 0.41421927993752855, test loss = 0.3937721639997466, test accuracy = 89.8 Training 500: training loss = 0.3956477517837601, test loss = 0.3776185653570958, test accuracy = 89.77000000000001 Training 600: training loss = 0.38068133522531516, test loss = 0.3629114440229983, test accuracy = 90.24 Training done! *** Epoch 1 *** Training 100: training loss = 0.36966974208364933, test loss = 0.35288598561614426, test accuracy = 90.64999999999999 Training 200: training loss = 0.3619074916867618, test loss = 0.34547338335220235, test accuracy = 90.77 Training 300: training loss = 0.3551972133200988, test loss = 0.3395805130693344, test accuracy = 90.86999999999999 Training 400: training loss = 0.34876157930517415, test loss = 0.33277139980258014, test accuracy = 91.09 Training 500: training loss = 0.343236119839324, test loss = 0.32840843255849345, test accuracy = 91.3 Training 600: training loss = 0.3384802777415614, test loss = 0.3237241705498349, test accuracy = 91.2 Training done! *** Epoch 2 *** Training 100: training loss = 0.3362124692184101, test loss = 0.3226848908023063, test accuracy = 91.07 Training 200: training loss = 0.33184906121395225, test loss = 0.31771433777198294, test accuracy = 91.14 Training 300: training loss = 0.32795963442747733, test loss = 0.31464079397526273, test accuracy = 91.47 Training 400: training loss = 0.32572288253606196, test loss = 0.31232695850658476, test accuracy = 91.53 Training 500: training loss = 0.3225478509666107, test loss = 0.3096462537886367, test accuracy = 91.53999999999999 Training 600: training loss = 0.3199326515325162, test loss = 0.30773874704328297, test accuracy = 91.55 Training done! CPU times: total: 8min 20s Wall time: 1min 46s
df_iter3 = pd.DataFrame.from_dict(results_per_iter4b)
df_iter3
| iteration | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 2.230907 | 2.235815 | 18.61 |
| 1 | 2 | 2.125183 | 2.129207 | 27.80 |
| 2 | 3 | 2.041743 | 2.044509 | 37.72 |
| 3 | 4 | 1.955124 | 1.956896 | 42.51 |
| 4 | 5 | 1.879467 | 1.877801 | 53.50 |
| ... | ... | ... | ... | ... |
| 1795 | 596 | 0.319732 | 0.307941 | 91.41 |
| 1796 | 597 | 0.319625 | 0.307548 | 91.57 |
| 1797 | 598 | 0.319943 | 0.307623 | 91.45 |
| 1798 | 599 | 0.319669 | 0.307397 | 91.53 |
| 1799 | 600 | 0.319933 | 0.307739 | 91.55 |
1800 rows × 4 columns
#Plot results
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:',
yanchor="top",
y=0.25,
xanchor="left",
x=0.85), title = 'Test accuracy for every iteration')
fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')
subop = {'Epoch 1': df_iter3[ 'test_accuracy'][0:600,],
'Epoch 2': df_iter3[ 'test_accuracy'][600:1200,],
'Epoch 3': df_iter3[ 'test_accuracy'][1200:1800,]}
for k, v in subop.items():
fig.add_scatter(x=v.index, y = v, name = k )
fig.show()
df_epoch4b = pd.DataFrame.from_dict(results_per_epoch4b)
df_epoch4b
| epoch | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 0.380681 | 0.362911 | 90.24 |
| 1 | 2 | 0.338480 | 0.323724 | 91.20 |
| 2 | 3 | 0.319933 | 0.307739 | 91.55 |
Observation(s):
Based on the results I obtained, it appears that batch size 10 is better since it produces higher test accuracy at each epoch compare to batch size 100 even if it is slightly slower. Batch size 1 produces less accuracy and is much slower than batch sizes 10 and 100. In this case, one possible reason could be that the larger batch size (100) do fewer and coarses search steps for the optimal solution. Thus, it will be less likely to converge compare to smaller batch size (10). Even though smaller batch size (10) oscillates more, which means it has more noise, than batch size (100), this noise helps come out of the local minima in non-convex cases which is often observed. Using batch size 1, there is much more noise that convergence is less likely to occur.
Instead of using random sampling, sort the data before training so that all ”1”s appear before ”2”s and so on. Then, sample sequentially in running SGD instead of random sampling. Does this work well, why?
#Sort labels_train
df = pd.DataFrame(labels_train, columns =['train_label'])
df['index'] = df.index
df_sorted = df.sort_values('train_label')
df_sorted
| train_label | index | |
|---|---|---|
| 38142 | 0 | 38142 |
| 44820 | 0 | 44820 |
| 44815 | 0 | 44815 |
| 8331 | 0 | 8331 |
| 8330 | 0 | 8330 |
| ... | ... | ... |
| 54922 | 9 | 54922 |
| 33204 | 9 | 33204 |
| 18688 | 9 | 18688 |
| 28145 | 9 | 28145 |
| 38140 | 9 | 38140 |
60000 rows × 2 columns
#use sorted indices for images
X_train_sorted= np.array([X_train[:, index] for index in df_sorted['index']])
X_train_sorted = X_train_sorted.T
X_train_sorted.shape
(784, 60000)
#use sorted indices for labels
Y_train_sorted = np.array([Y_train[:, index] for index in df_sorted['index']])
Y_train_sorted = Y_train_sorted.T
Y_train_sorted.shape
(10, 60000)
Y_train_sorted
array([[1., 1., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 1., 1., 1.]])
%%time
#Train
results_per_iter5, results_per_epoch5, dict_f5, dict_b5, params5 =mini_batch_fc(X_train_sorted, Y_train_sorted, X_test, Y_test, input_size = 28*28, output_size = 10, batch_size = 10, epoch = 3, rand_sample = False)
*** Epoch 0 *** Training 1000: training loss = 7.6560136185957965, test loss = 7.784962497036138, test accuracy = 19.82 Training 2000: training loss = 6.823384971637918, test loss = 6.962241353393071, test accuracy = 10.26 Training 3000: training loss = 8.046814939336956, test loss = 8.180553752444181, test accuracy = 9.86 Training 4000: training loss = 6.964878336372151, test loss = 7.144777356797537, test accuracy = 9.610000000000001 Training 5000: training loss = 6.004155071774029, test loss = 6.123391019369141, test accuracy = 10.059999999999999 Training 6000: training loss = 7.2011510893257045, test loss = 7.324521361934094, test accuracy = 10.09 Training done! *** Epoch 1 *** Training 1000: training loss = 4.877579851034547, test loss = 4.945013013043592, test accuracy = 21.349999999999998 Training 2000: training loss = 4.7513757967582615, test loss = 4.8532523920743085, test accuracy = 12.809999999999999 Training 3000: training loss = 5.792010286794085, test loss = 5.8955493727586035, test accuracy = 18.3 Training 4000: training loss = 4.952379765511491, test loss = 5.10419580281234, test accuracy = 12.280000000000001 Training 5000: training loss = 5.075148569930429, test loss = 5.16851943637267, test accuracy = 16.189999999999998 Training 6000: training loss = 5.849866551038217, test loss = 5.960995197913896, test accuracy = 10.530000000000001 Training done! *** Epoch 2 *** Training 1000: training loss = 4.006515437987527, test loss = 4.0482843741270615, test accuracy = 26.810000000000002 Training 2000: training loss = 4.105667161852119, test loss = 4.195502118034864, test accuracy = 16.39 Training 3000: training loss = 5.24883062562635, test loss = 5.347894459254287, test accuracy = 23.880000000000003 Training 4000: training loss = 4.339424038936531, test loss = 4.478378353012093, test accuracy = 18.38 Training 5000: training loss = 4.765526663587499, test loss = 4.855919003890725, test accuracy = 19.3 Training 6000: training loss = 5.398176060691625, test loss = 5.505538818901623, test accuracy = 13.91 Training done! CPU times: total: 1h 25min 1s Wall time: 16min 13s
df_iter5 = pd.DataFrame.from_dict(results_per_iter5)
df_iter5
| iteration | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 3.957248 | 4.041774 | 9.80 |
| 1 | 2 | 4.027062 | 4.113994 | 9.80 |
| 2 | 3 | 4.100136 | 4.189165 | 9.80 |
| 3 | 4 | 4.120589 | 4.210263 | 9.80 |
| 4 | 5 | 4.167508 | 4.258917 | 9.80 |
| ... | ... | ... | ... | ... |
| 17995 | 5996 | 5.395582 | 5.502895 | 13.92 |
| 17996 | 5997 | 5.395785 | 5.503103 | 13.92 |
| 17997 | 5998 | 5.396005 | 5.503328 | 13.92 |
| 17998 | 5999 | 5.396370 | 5.503697 | 13.92 |
| 17999 | 6000 | 5.398176 | 5.505539 | 13.91 |
18000 rows × 4 columns
#Plot results
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:',
yanchor="top",
y=0.85,
xanchor="left",
x=0.95), title = 'Test accuracy for every iteration')
fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')
subop = {'Epoch 1': df_iter5[ 'test_accuracy'][0:6000,],
'Epoch 2': df_iter5[ 'test_accuracy'][6000:12000,],
'Epoch 3': df_iter5[ 'test_accuracy'][12000:18000,]}
for k, v in subop.items():
fig.add_scatter(x=v.index, y = v, name = k )
fig.show()
Observation(s):
Based on the results above, it appears that using a sorted training data does not work very well because each batch contains instances from the same classes. It is important to shuffle the data to make sure that the variance is reduced and the model is generalized as much as possible. In this case, since mostly each batch contains instances from one class, each training batch does not represent the overall distribution of the data, which results in inaccurate gradients and causes the accuracy to decrease. If we always have the data point (say point 11) after data point 10 at every epoch, it's gradient will be biased towards the model update resulting from point 10. Hence, we need to shuffle the data so the updates created by each data point or batches are independent.
(Bonus point) Add a hidden layer with 11 hidden neurons and ReLU activation function. Then, plot the accuracy curve to see how the accuracy changes.
Note: if you want to solve the bonus question since there are two more layers (a linear layer and a ReLU activation layer). If you want to do the bonus too, you may want to code back-propagation by calculating the derivative formula for each layer and hard-coding it into your python code. Then, you can write code that multiplies them to get the final gradient as discussed in the class. If you go this route, you can use it for both bonus and non-bonus parts of the assignment
Answer:
#relu function
def relu(r):
return np.maximum(r, 0)
#return np.array([1 if i>0 else 0 for i in r])
#relu derivative function
def d_relu(Z):
return Z > 0
#softmax function
def softmax(x):
#exp=np.exp(x-x.max())
#return exp/np.sum(exp,axis=0)*(1-exp/np.sum(exp,axis=0))
return np.exp(x) / np.sum(np.exp(x), axis=0)
#loss function (cross entropy loss)
def loss(Y, Y_pred):
log_sum = -np.sum(np.multiply(Y, np.log(Y_pred))) #log sum
m = Y.shape[1]
L = (1./m) * log_sum #average over m=10 training samples
return L
#accuracy function
def accuracy(labels, predictions):
total_correct = 0
m = labels.shape[1]
predictions = np.argmax(predictions, axis=0)
labels = np.argmax(labels, axis=0)
for i in range(len(labels)):
if (predictions[i] == labels[i]):
total_correct += 1
return total_correct/m*100
#forward propagation function
def forward(X, params):
dict_f={} #initialize empty dictionary for forward results
# input layer to l1: Z1 = w1^T*x + b1
dict_f['z1'] = np.matmul(params['w1'],X) + params['b1']
dict_f['l1'] = relu(dict_f['z1']) #using equation (1) above
# l1 to output layer using softmax: Z2 = w2^T*l1 + b2
dict_f['z2'] = np.matmul(params['w2'],dict_f['l1']) + params['b2']
dict_f['l2'] = softmax(dict_f['z2'] ) #get probabilities of the output using softmax
return dict_f
#backward propagation function
def backward(X,Y,params,dict_f,batch_size):
#intiliaze empty dictionary for results
dict_b = {}
#compute derivatives of layer 2 wrt z, w, and b
dz2 = dict_f['l2'] - Y
dict_b['dw2'] = (1./batch_size) * np.matmul(dz2, dict_f["l1"].T)
dict_b['db2'] = (1./batch_size) * np.sum(dz2, axis=1, keepdims=True) #bias update
#compute derivatives of layer 1 wrt z, w, and b
dl1 = np.matmul(params["w2"].T, dz2)
dz1 = dl1 * d_relu(dict_f["z1"])
dict_b['dw1'] = (1./batch_size) * np.matmul(dz1, X.T)
dict_b['db1'] = (1./batch_size) * np.sum(dz1, axis=1, keepdims=True)
return dict_b
np.random.seed(11)
#initialize hyperparameters
m = X_train.shape[1] #number of training samples
beta = 0.9 #weight parameter for moving average between 0 and 1 (higher to smooth out update)
learning_rate = 0.1
epoch = 3 #number of epochs
batch_size = 10 # batch size
input_size = 28*28 #num of inputs
output_size = 10 #num of outputs = 10 digits
hidden_size = 11 #hidden size1
n = 10 #number of iterations
batches = int(X_train.shape[1] / batch_size) #number of batches
#initilialize empty accuracy array
results_per_iter = []
results_per_epoch = []
#initialize parameters
#divide by sqrt(n) to adjust the variance to 1/n
params = { "w1": np.random.randn(hidden_size, input_size) * np.sqrt(1. / input_size),
"b1": np.zeros((hidden_size, 1)) * np.sqrt(1. / input_size),
"w2": np.random.randn(output_size, hidden_size) * np.sqrt(1. / hidden_size),
"b2": np.zeros((output_size, 1)) * np.sqrt(1. / hidden_size) }
#initialize exponential moving average of gradients as 0
#needed to add momentum
dw1_v = np.zeros(params["w1"].shape)
db1_v = np.zeros(params["b1"].shape)
dw2_v = np.zeros(params["w2"].shape)
db2_v = np.zeros(params["b2"].shape)
%%time
#loop through each batch of samples
np.random.seed(11)
for n in range(epoch):
print("*** Epoch {} ***".format(n))
#randomly permute column indices
#indices = np.random.choice(X_train.shape[1], batch_size, replace=False)
#sampling with replacement
indices = np.random.permutation(m)
X = X_train[:, indices]
Y = Y_train[:, indices]
#counter
count = 0
#iterate through each batch of 10 samples in m total training samples
for i in range(0, m, batch_size):
#assign i-th batch to variables
X_i = X[:, i:i+batch_size]
Y_i = Y[:, i:i+batch_size]
#normalize input betw [-1,1]
#a = -1
#b = 1
#X_i = (b-a)* np.divide((X_i-X_i.min()), (X_i.max() - X_i.min()))+ a
#Y_i = (b-a)* np.divide((Y_i-Y_i.min()), (Y_i.max() - Y_i.min()))+ a
#perform forward and backward pass
dict_f = forward(X_i, params)
dict_b = backward(X_i, Y_i, params, dict_f, batch_size)
#update moving average of gradients
dw1_v = (beta * dw1_v) + (1. - beta) * dict_b["dw1"]
db1_v = (beta * db1_v) + (1. - beta) * dict_b["db1"]
dw2_v = (beta * dw2_v) + (1. - beta) * dict_b["dw2"]
db2_v = (beta * db2_v) + (1. - beta) * dict_b["db2"]
#update parameters/weights using GD update rule with fixed learning rate
params["w1"] = params["w1"] - learning_rate * dw1_v
params["b1"] = params["b1"] - learning_rate * db1_v
params["w2"] = params["w2"] - learning_rate * dw2_v
params["b2"] = params["b2"] - learning_rate * db2_v
#train
dict_f =forward(X_train, params)
train_loss = loss(Y_train, dict_f["l2"])
#test
dict_f = forward(X_test, params)
test_loss = loss(Y_test, dict_f["l2"])
acc = accuracy(Y_test, dict_f["l2"])
#update counter
count += 1
#save results
iter_results = {'iteration': count , 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
results_per_iter.append(iter_results)
#display results
if (count%1000 == 0) :
print("Training {}: training loss = {}, test loss = {}, test accuracy = {} ".format(count,train_loss, test_loss, acc))
#train
dict_f =forward(X_train, params)
train_loss = loss(Y_train, dict_f["l2"])
#test
dict_f = forward(X_test, params)
test_loss = loss(Y_test, dict_f["l2"])
acc = accuracy(Y_test, dict_f["l2"])
#save results
e = n+1
results = {'epoch': e, 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
results_per_epoch.append(results)
print("Training done!")
*** Epoch 0 *** Training 1000: training loss = 0.38355553341135973, test loss = 0.3724378859108159, test accuracy = 89.01 Training 2000: training loss = 0.3593388943203853, test loss = 0.36250821527669, test accuracy = 89.19 Training 3000: training loss = 0.3173773357911038, test loss = 0.3171075609996651, test accuracy = 90.64 Training 4000: training loss = 0.3480446505682277, test loss = 0.3618698384653955, test accuracy = 89.9 Training 5000: training loss = 0.28684855488089483, test loss = 0.3016496329769622, test accuracy = 91.23 Training 6000: training loss = 0.2767988529906571, test loss = 0.28050895592373754, test accuracy = 91.71000000000001 *** Epoch 1 *** Training 1000: training loss = 0.35641718682145784, test loss = 0.36632933057348877, test accuracy = 88.64 Training 2000: training loss = 0.3564540217338831, test loss = 0.3711754956476994, test accuracy = 89.44 Training 3000: training loss = 0.26416674736248996, test loss = 0.27520556760283676, test accuracy = 91.95 Training 4000: training loss = 0.2607346357944239, test loss = 0.2787991979786823, test accuracy = 91.84 Training 5000: training loss = 0.24823455414601278, test loss = 0.2590064420233197, test accuracy = 92.13 Training 6000: training loss = 0.24339353492441693, test loss = 0.2419777165580345, test accuracy = 92.77 *** Epoch 2 *** Training 1000: training loss = 0.27255113699593736, test loss = 0.29599908356886795, test accuracy = 91.53 Training 2000: training loss = 0.23938062548106737, test loss = 0.24885206657253417, test accuracy = 92.92 Training 3000: training loss = 0.25739419730428764, test loss = 0.27214245862355396, test accuracy = 92.08 Training 4000: training loss = 0.26084625394861716, test loss = 0.272947114345603, test accuracy = 92.2 Training 5000: training loss = 0.240215044139242, test loss = 0.24878243191874272, test accuracy = 92.95 Training 6000: training loss = 0.24340524266235095, test loss = 0.27626885465156054, test accuracy = 91.89 Training done! CPU times: total: 1h 38min 4s Wall time: 19min 22s
df_iter6 = pd.DataFrame.from_dict(results_per_iter)
df_iter6
| iteration | train_loss | test_loss | test_accuracy | |
|---|---|---|---|---|
| 0 | 1 | 2.300597 | 2.301236 | 9.78 |
| 1 | 2 | 2.291695 | 2.292165 | 10.61 |
| 2 | 3 | 2.281327 | 2.281890 | 12.16 |
| 3 | 4 | 2.271067 | 2.271815 | 13.28 |
| 4 | 5 | 2.261570 | 2.262194 | 14.66 |
| ... | ... | ... | ... | ... |
| 17995 | 5996 | 0.232484 | 0.264917 | 92.38 |
| 17996 | 5997 | 0.235673 | 0.268251 | 92.31 |
| 17997 | 5998 | 0.238533 | 0.271150 | 92.17 |
| 17998 | 5999 | 0.240725 | 0.273587 | 91.99 |
| 17999 | 6000 | 0.243405 | 0.276269 | 91.89 |
18000 rows × 4 columns
#Plot results
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:',
yanchor="top",
y=0.25,
xanchor="left",
x=0.90), title = 'Test accuracy for every iteration')
fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')
subop = {'Epoch 1': df_iter6['test_accuracy'][0:6000,],
'Epoch 2': df_iter6['test_accuracy'][6000:12000,],
'Epoch 3': df_iter6['test_accuracy'][12000:18000,]}
for k, v in subop.items():
fig.add_scatter(x=v.index, y = v, name = k )
fig.show()